Exploratory Data Analysis of Airbnb Accommodation in Copenhagen
An Exam Project in Business Data Processing and Business Intelligence
- 2.0 INTRODUCTION
- 3.0 METHODOLOGY
- 4.0 PROCESSING THE DATA
- 5.0 EXPLORATORY DATA PROCESSING
- Visualizations
1.0 ABSTRACT
Topic
Problem formulation
Research question
Concepts
Dataset and main data analytics methods and tools
Most important results
Conclusions and recommendations
2.0 INTRODUCTION
Since 2008, Airbnb has grown from a small accommodation platform, hosted in San Francisco, to one that is now recognised throughout the world. Airbnb has revolutionized the tourism housing industry by applying a sharing economy model to the accommodation business. Today, Airbnb has become the world’s largest accommodation service provider with more accommodation options than any other accommodation business - and even more than all of them combined. As a platform, Airbnb enables people (hosts) to offer accommodation services to other people (guests), providing guests with a more unique and personalized way of experiencing the world, and often at a reasonably lower price than other accommodation options. Only just a fraction (20%) of these transactions are captured by Airbnb, which in 2019 returned 4,7 billion USD in sales revenues.
2.1 Problem Formulation and Research Question
Data plays a key role in Airbnb’s success. For instance, data enables Airbnb to match guests and hosts and further allows the users to filter the host listings to their likings, in respect of pricing, location, number of beds, and much more. Thereby, data is essential to securing high customer satisfaction. Moreover, Airbnb can use the collected data to extract insights that can be used to improve their service offerings, guide decision making, guide marketing initiatives, and more.
As a platform, Airbnb’s sole value creation lies in creating successful matches between guests and hosts and by ensuring a positive experience for both parties. Naturally, if the platform fails to deliver a positive experience to a user, the user might neglect the platform in total, resulting in negative feedback loops. This leads us to our research question:
How can Airbnb ensure matches and the experiences they create are positive for their customers (users), and providers (hosts)? Moreover, how can Airbnb help guide user decisions to create successful matches and positive experiences?
Currently, Airbnb helps the users to create meaningful matches, by allowing the guests to limit their search for accommodation by different attributes related to the individual host listing. As such, users can easily find accommodation that meets their basic needs for accommodation; e.g. number of beds, bedrooms, price, room type, etc. However, without any knowledge of the different location areas, guests might find difficulty in choosing a location that suits their needs.
In this project, we will examine the accommodation services, listed on Airbnb for Copenhagen, in special regard to the location areas, and the attributes that are associated with them. The goal is to create a report that can guide customers to choose a location that lives up to their expectations, thereby improving the quality of the matches provided by the platform.
3.1 Dataset Analysis Process
To answer the question of interest, we perform an exploratory data analysis (EDA) to gain an understanding of what features that seperates and defines neighbourhoods and the differences in their accommodation offerings. More specifically, we will zoom in on the neighbourhoods with respect to which type of room and property that are most common in the area, and how the neighbourhood affect the listing price. Moreover, the price for accommodating one person is calculated to provide an indication of wealth.
Moving on, we create an interactive map that displays each individual accommodation offering in a geospatial visualization. Through interaction the map allows the user to easily find listings, view that most expensivest listings and display where each neighbourhood is located. Furthermore, the interactivity enables zooming, moving and filtering of the data to enhance the understanding of the geospace.
Finally, we create wordcloud visualizations to display how hosts are descriping the neighbourhoods, that allows us to get a sense of which words that best describes the location areas.
3.2 Dataset Description
The data was downloaded from the independent site: Inside Airbnb, which scrapes data from Airbnb, and makes it puplicly available for analysis. This site provides a multitude of datasets containing information on the most populated cities around the world - including Copenhagen.
The datasets provided by Inside Airbnb is as follows: (1) listings, (2) calendar, (3) reviews, (4) listings_summary, (5) reviews_summary.
We have downloaded and inspected all of the datasets. However, only the listings are assessed to be important for this project.
The most recent data set is used, which was scraped on 28th of Nov. 2020.
The listings dataset contains data about the airbnb host listings and their respective attributes. In total, there are 74 columns describing 8636 listings on the Airbnb platform. However, for this project, the following 17 attributes has been selected for analysis:
- id: primary key (listings_id)
- name: name of listing
- description: room description
- neighbourhood_overview: text description of the neighbourhood
- neightbourhood_cleansed: location area cleaned from special charaters
- latitude: latitude
- longitude: longitude
- property_type: type of the property where the room is in
- room_type: type of room that is made available
- accommodates: max number of people that can stay at a time
- beds: number of beds in the room
- bedrooms: number of bedrooms in the room
- ammenties: facilities available
- price: price of the room per day
- number_of_reviews: number of times the listing was reviewed
- review_scores_rating: average review score of the listing
3.3 Preprocessing Steps
As usual, before we can initialize the data exploration, we will need to preprocess the data. Overall the preprocessing will follow the following structure:
- Install and Import libraries
- Gather data
- Data Cleaning
By utilizing the pandas library, we download and unzip the data using pandas built-in decompression tool, and then using it to create dataframes that stores the data, enabling data cleaning and manipulation. We clean the data by selecting only the 17 columns as listed previously, checking for misclassified datatypes and renaming columns and values to ease interpretation of the data.
Without further ado, let's get started!
In this section, we will process and clean the data before initiating the exploratory data analysis.
This project is created in a colab notebook and exported to fastpages for improved readability and interactive features such as the code button provided bellow. You will find these buttons throughout this paper, however some code snippets has been hidden entirely.
If you wish to review the full line of code, please see the buttons under the headline of this post.
import pandas as pd #used to store and manage the data
import numpy as np
import matplotlib.pyplot as plt #visualization library
import plotly.express as px #visualization library used for geospatial data
#Wordcloud related libraries
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
We initiate our data analysis by installing and importing the libraries for our python interpreter. For this project, we use pandas, pyplot, express, WordCloud, STOPWORDS, and ImageColorGenerator. These libraries enables us to clean and process the data and later to create meaningful visualizations.
#Create DataFrames
listings = pd.read_csv('http://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2020-11-28/data/listings.csv.gz', compression='gzip')
listings = listings.iloc[:,[0,4,5,6,27,29,30,31,32,33,36,37,38,39,55,59,60]]
The dataset is downloaded, unzipped and stored in a dataframe using pandas. Again using pandas, the irrelevant columns are dropped, thereby only the columns chosen for this project remains.
Now that, we have gathered the data, let us take a quick glimpse on the data:
listings.head(3)
listings.info()
listings.isnull().sum()
listings.shape
From the quick glimpse, we see that not all hosts are providing a description of the neighbourhood that the listing is located at (the values appear as missing (NaN). In fact, 3700 listings appear without a description of the neighbourhood.
We find that, in the cleansed neighbourhood attribute letter 'ø' has been removed from the name.
The Price attribute is listed as USD marked by a '$', however the listing price should be denoted as DKK. Moreover, the price is interpreted as an object instead of an integer-value. Similarly, 'bedrooms' and 'beds' are interpreted as float types instead of intergers. Additionally, the last_review attribute should be classified as a datatime type.
Lets try to correct the above mentioned data anomali.
We start by removing renaming the columns, and then the neighbourhood names to be easier interpretable. To correct the datatypes, some actions has to happen beforehand. For the pricing, the '$' and ',' is removed before correcting the data type. For beds and bedrooms, the missing values are filled with a zero, under the assumption, that these listings are available but does not offer either a bed or a bedroom.
#Rename columns
listings.rename(columns={'id':'listing_id','name':'listing_name','description':'listing_description', 'neighbourhood_overview':'neighbourhood_description'},inplace=True)
#Rename Neighbourhood Values
listings['neighbourhood_cleansed'] = listings.neighbourhood_cleansed.replace(
{
'Nrrebro':'Nørrebro',
'sterbro':'Østerbro',
'Amager st': 'Amager Øst',
'Vanlse': 'Vanløse',
'Brnshj-Husum':'Brønshøj-Husum'
}
)
#Fill missing values
cols = ['beds','bedrooms']
listings[cols] = listings[cols].fillna('0')
#Correct DataTypes
listings = listings.astype(
{
'bedrooms':int,
'beds':int,
'last_review':'datetime64[ns]'
}
)
#Correct Prices from $ to DKK, then DataType
listings.price = listings.price.str.replace(',','')
listings.price = listings.price.str.replace('$','')
listings.price = listings.price.astype(float)
clean = listings
We should now have clean data, that we can use to analyze the attributes of the listings. Here, we will investigate the distributions of prices, neighbourhoods, property types, and room types. Similarly, we will investigate the average prices of listings by neighbourhood, property type, and by room type. In conclusion, we aim to list differnces that occour for each different category.
Nearing the end, the two datasets are merged into one dataframe, that contains data about both listings and reviews. These are joined by the listings_id using the 'inner' property. We can then use this dataframe to display the interactive map, and wordclouds that summarized the reviews of each neighbourhood.
neighbourhoods = listings['neighbourhood_cleansed'].value_counts().to_frame(name='listings').reset_index()
neighbourhoods
listings[['property_type']].value_counts().to_frame(name='listings').reset_index()
listings[['room_type']].value_counts().to_frame(name='listings').reset_index()
#Select properties listed more than 400 times
listings_clean = listings[listings.property_type.isin(['Entire apartment','Private room in apartment','Entire condominium','Entire house'])]
#Count number of listings in neighbourhoods by property type
listings_byNeighbourhood = listings_clean.groupby(['neighbourhood_cleansed','property_type']).neighbourhood_cleansed.count().to_frame(name = 'listings').reset_index()
#Sum number of listings per neighbourhood
listingsNeighbourhoodCount = listings_byNeighbourhood.groupby('neighbourhood_cleansed')['listings'].sum().to_frame(name = 'total_listings').sort_values(by='total_listings', ascending=False).reset_index()
#Calculate ratio of property types in the different neighbourhoods
neighbourhoodPropertyRatio = listings_byNeighbourhood.merge(listingsNeighbourhoodCount, on='neighbourhood_cleansed')
neighbourhoodPropertyRatio['ratio_of_property_type_in_neighbourhood'] = neighbourhoodPropertyRatio['listings']/neighbourhoodPropertyRatio['total_listings']*100
neighbourhoodPropertyRatio.head(10)
#Count number of listings in neighbourhoods by property type
roomCount = listings.groupby(['neighbourhood_cleansed','room_type']).neighbourhood_cleansed.count().to_frame(name = 'listings').reset_index()
#Sum number of listings per neighbourhood
roomNeighbourhoodCount = roomCount.groupby('neighbourhood_cleansed')['listings'].sum().to_frame(name = 'total_listings').sort_values(by='total_listings', ascending=False).reset_index()
#Calculate ratio of property types in the different neighbourhoods
roomRatio = roomCount.merge(listingsNeighbourhoodCount, on='neighbourhood_cleansed')
roomRatio['ratio_of_room_type_in_neighbourhood'] = roomRatio['listings']/roomRatio['total_listings']*100
roomRatio.head(10)
df = pd.DataFrame()
df['avg_n_accommodations'] =listings.groupby('neighbourhood_cleansed').accommodates.mean()
df = df.reset_index()
df
neighbourhoodPricing = listings.groupby('neighbourhood_cleansed').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
neighbourhoodPricing
#Calculate Average Price Per Person
df['price'] = neighbourhoodPricing.price
df['price_perPerson'] = df.price/df.avg_n_accommodations
df
PropertyPricing = listings.groupby('property_type').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
PropertyPricing.head(20)
roomPricing = listings.groupby('room_type').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
roomPricing.head(20)
#Merge reviews and listings
group_listingReviews = reviews.merge(listings, on='listing_id', how='inner')
#Define mapbox API token and style
mapbox_access_token = 'pk.eyJ1IjoiYWNodG9uMjExMSIsImEiOiJja2lyam5yemgyNTV0MnJsYmJ0NXdzNWRxIn0.rWJgur27hJnWoBt7Oq5LeQ'
px.set_mapbox_access_token(mapbox_access_token)
plot_style = 'mapbox://styles/achton2111/ckirsv5df0aj01at4zp0d7f3w'
#Interactive Geospacial plot
fig = px.scatter_mapbox(group_listingReviews,
lat="latitude",
lon="longitude",
color="neighbourhood_cleansed",
zoom=10,
size='price',
mapbox_style= plot_style,
hover_name='listing_name',
hover_data = {'price',
'property_type',
'room_type',
'accommodates',
'beds',
'review_scores_rating'},
opacity = 0.8,
title = 'AirBnB Listing Locations. Coloured by Neighbourhood, Size by Price)'
)
fig.show()